3 research outputs found
Expanding the Usage of Web Archives by Recommending Archived Webpages Using Only the URI
Web archives are a window to view past versions of webpages. When a user requests a webpage on the live Web, such as http://tripadvisor.com/where_to_t ravel/, the webpage may not be found, which results in an HyperText Transfer Protocol (HTTP) 404 response. The user then may search for the webpage in a Web archive, such as the Internet Archive. Unfortunately, if this page had never been archived, the user will not be able to view the page, nor will the user gain any information on other webpages that have similar content in the archive, such as the archived webpage http://classy-travel.net. Similarly, if the user requests the webpage http://hokiesports.com/football/ from the Internet Archive, the user will only find the requested webpage, and the user will not gain any information on other webpages that have similar content in the archive, such as the archived webpage http://techsideline.com. In this research, we will build a model for selecting and ranking possible recommended webpages at a Web archive. This is to enhance both HTTP 404 responses and HTTP 200 responses by surfacing webpages in the archive that the user may not know existed. First, we detect semantics in the requested Uniform Resource Identifier (URI). Next, we classify the URI using an ontology, such as DMOZ or any website directory. Finally, we filter and rank candidates based on several features, such as archival quality, webpage popularity, temporal similarity, and content similarity. We measure the performance of each step using different techniques, including calculating the F1 to measure of different tokenization methods and the classification. We tested the model using human evaluation to determine if we could classify and find recommendations for a sample of requests from the Internet Archive’s Wayback Machine access log. Overall, when selecting the full categorization, reviewers agreed with 80.3% of the recommendations, which is much higher than “do not agree” and “I do not know”. This indicates the reviewer is more likely to agree on the recommendations when selecting the full categorization. But when selecting the first level only, reviewers only agreed with 25.5% of the recommendations. This indicates that having deep level categorization improves the performance of finding relevant recommendations
Impact of URI Canonicalization on Memento Count
Quantifying the captures of a URI over time is useful for researchers to
identify the extent to which a Web page has been archived. Memento TimeMaps
provide a format to list mementos (URI-Ms) for captures along with brief
metadata, like Memento-Datetime, for each URI-M. However, when some URI-Ms are
dereferenced, they simply provide a redirect to a different URI-M (instead of a
unique representation at the datetime), often also present in the TimeMap. This
infers that confidently obtaining an accurate count quantifying the number of
non-forwarding captures for a URI-R is not possible using a TimeMap alone and
that the magnitude of a TimeMap is not equivalent to the number of
representations it identifies. In this work we discuss this particular
phenomena in depth. We also perform a breakdown of the dynamics of counting
mementos for a particular URI-R (google.com) and quantify the prevalence of the
various canonicalization patterns that exacerbate attempts at counting using
only a TimeMap. For google.com we found that 84.9% of the URI-Ms result in an
HTTP redirect when dereferenced. We expand on and apply this metric to TimeMaps
for seven other URI-Rs of large Web sites and thirteen academic institutions.
Using a ratio metric DI for the number of URI-Ms without redirects to those
requiring a redirect when dereferenced, five of the eight large web sites' and
two of the thirteen academic institutions' TimeMaps had a ratio of ratio less
than one, indicating that more than half of the URI-Ms in these TimeMaps result
in redirects when dereferenced.Comment: 43 pages, 8 figure
Dynamic load modeling for bulk load-using synchrophasors with wide area measurement system for smart grid real-time load monitoring and optimization
Bulk data modeling in a smart grid dynamic network has been performed using an automated load modeling tool (ALMT), an on-load tap changer, and exponential dynamic load modeling. However, studies have observed that a small parameter variation may lead to considerable variations in measuring grid big data. Therefore, this study presents dynamic real-time load modeling, monitoring, and optimization method for the bulk load. The case study was conducted on Sarawak Energy Berhad (SEB), Malaysia. The grid system’s real-time data and load modeling achieved the objectives. Dynamic load model was achieved by using load response in MATLAB
Simulink environment. This paper also includes new parameter estimations of the load composition at the selected bus. The simulation results of load models were compared with the recorded data by applying an event of bus tripping time interval. The Least Square Error Method was used to converge the estimated parameter values on load composition and compared with the actual recorded data until optimized load models were achieved. This work is a precious and significant contribution to utility research to identify, monitor, and optimize the most appropriate representation of system loads